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Abstract —We study the problem of clustering with relative 
constraints, where each constraint specifies relative similarities 
among instances. In particular, each constraint (xi,Xj,Xk) is 
acquired by posing a query: is instance x, more similar to x :i 
than to Xk? We consider the scenario where answers to such 
queries are based on an underlying (but unknown) class concept, 
which we aim to discover via clustering. Different from most 
existing methods that only consider constraints derived from 
yes and no answers, we also incorporate don’t know responses. 
We introduce a Discriminative Clustering method with Relative 
Constraints (DCRC) which assumes a natural probabilistic rela¬ 
tionship between instances, their underlying cluster memberships, 
and the observed constraints. The objective is to maximize the 
model likelihood given the constraints, and in the meantime 
enforce cluster separation and cluster balance by also making use 
of the unlabeled instances. We evaluated the proposed method 
using constraints generated from ground-truth class labels, and 
from (noisy) human judgments from a user study. Experimental 
results demonstrate: 1) the usefulness of relative constraints, 
in particular when don’t know answers are considered; 2) the 
improved performance of the proposed method over state-of-the- 
art methods that utilize either relative or pairwise constraints; 
and 3) the robustness of our method in the presence of noisy 
constraints, such as those provided by human judgement. 

I. Introduction 

Unsupervised clustering can be improved with the aid of 
side information for the task at hand. In general, side informa¬ 
tion refers to knowledge beyond instances themselves that can 
help inferring the underlying instance-to-cluster assignments. 
One common and useful type of side information has been 
represented in the form of instance-level constraints that 
expose instance-level relationships. 

Previous work has primarily focused on the use of pair¬ 
wise constraints (e.g., m-jni), where a pair of instances 
is indicated to belong to the same cluster by a Must-Link 
(ML) constraint or to different clusters by a Cannot-Link 
(CL) constraint. More recently, various studies lfl2l — IfTTl have 
suggested that domain knowledge can also be incorporated in 
the form of relative comparisons or relative constraints, where 
each constraint specifies whether instance Xi is more similar 
to Xj than to Xk- 

We were motivated to focus on relative constraints for a 
couple of reasons. First, the labeling (proper identification) 
of relative constraints by humans appears to be more reliable 
than that of pairwise constraints. Research in psychology has 
revealed that people are often inaccurate in making absolute 
judgments (required for pairwise constraints), but they are 
more trustworthy when judging comparatively ED. Consider 


one of our applications, where we would like to form clusters 
of bird song syllables based on spectrogram segments from 
recorded sounds. Figure 1(a) and 1(b) shows examples of the 
two types of constraints/questions considered. In the examples, 
syllable 1 in both figures and syllable 3 in 1(b) are from the 
same singing pattern and syllable 2 in both figures belongs to 
a different one. From the figures, it is apparent that making 
an absolute judgment for the pairwise constraint in 1(a) is 
more difficult. In contrast, the comparative question for la¬ 
beling relative constraint in 1(b) is much easier to answer. 
Second, since each relative constraint includes information 
about three instances, they tend to be more informative than 
pairwise constraints (even when several pairwise constraints 
are considered). This is formally characterized in Section ITl-AI 

In the area of learning from relative constraints, most work 
uses metric learning approaches (El-El. Such approaches 
assume that there is an underlying metric that determines the 
outcome of the similarity comparisons, and the goal is to 
learn such a metric. The learned metric is often later used for 
clustering (e.g., via Kmeans or related approaches). In practice, 
however, we may not have access to an oracle metric. Often the 
constraints are provided in a way that instances from the same 
class are considered more similar than those from different 
classes. This paper explicitly considers such scenarios where 
constraints are provided based on the underlying class concept. 
Unlike the metric-based approaches, we aim to directly infer 
an optimal clustering of the data using the provided relative 
comparisons, without requiring explicit metric learning. 

Formally, we regard each constraint ( Xi,Xj,Xk ) as being 
obtained by asking: is Xi more similar to Xj than to Xk, and the 
answer is provided by a user/oracle based on the underlying 
instance clusters. In particular, a yes answer is given if Xi 
and Xj are believed to belong to the same cluster while Xk is 
believed to be from a different one. Similarly, the answer will 
be no if it is believed that Xi and Xk are in the same cluster 
which is different from the one containing Xj. Note that for 
some triplets, it may not be possible to provide a yes or no 
answer. For example, if the three instances belong to the same 
cluster, as shown in figure 1(c); or if each of them belongs 
to a different cluster, as shown in figure 1(d). Such cases 
have been largely ignored by prior studies. Here, we allow the 
user to provide a don’t know answer (dnk) when yes/no can 
not be determined. Such dnk’ s not only allow for improved 
labeling flexibility, but also provide useful information about 
instance clusters that can help improve clustering, as will be 
demonstrated in Section Hl-AI and the experiments. 

In this work, we introduce a discriminative clustering 
method, DCRC, that learns from relative constraints with yes, 






(a) Pairwise Const. 


(b) Relative Const. 


(c) Relative Const. 


(d) Relative Const. 


Fig. 1. Examples for labeling pairwise vs. relative constraints from Birdsong data. Labeling question for (a): Do syllable 1 and syllable 2 belong to the same 
cluster? Labeling question for (b) (c) and (d): Is syllable 1 more similar to syllable 2 than to syllable 3? (a) and (b) reveal the cases where relative constraints 
are easier to label. The cases in (c) and (d) motivate the introducing of a don’t know answer for relative constraints. 


no, or dnk labels (Section ED DCRC uses a probabilistic 
model that naturally connects the instances, their underlying 
cluster memberships, and the observed constraints. Based 
on this model, we present a maximum-likelihood objective 
with additional terms enforcing cluster separation and cluster 
balance. Variational EM is used to find approximate solutions 
(Section |Iy}. In the experiments (Section [V}, we first evaluate 
our method on both UCI and additional real-world datasets 
with simulated noise-free constraints generated from ground- 
truth class labels. The results demonstrate the usefulness of 
relative constraints including don’t know answers, and the 
performance advantage of our method over current state-of- 
the-art methods for both relative and pairwise constraints. We 
also evaluate our method with human-labeled noisy constraints 
collected from a user study, and results show the superiority 
of our method over existing methods in terms of robustness to 
the noisy constraints. 

II. Problem Analysis 

In this section, we first compare the cluster label information 
obtained by querying different types of constraints, analyzing 
the usefulness of relative constraints. Then we formally state 
the problem. 

A. Information from Constraints 

Here we provide a qualitative analysis with a simplified 
but illustrative example. Suppose we have N i.i.d instances 
{xi}iLi sampled from K clusters with even prior 1/K. 
Consider a triplet ( x tl ,Xt 2 ,Xt 3 ) and a pair (xbt,Xb 2 ). Let 
Yt = [t/t 1 ,yt 2 ,2/t 3 ] T and Y b = [yb 1 ,yb 2 ] T be their correspond¬ 
ing cluster labels. Let l t €E {yes, no, dnk} and l' b £ {ML, CL} 
be the label for the relative and pairwise constraint respectively. 
In this example they are determined by 


yes, 

if 

ytt = yt 2 , y tl ± yt 3 


no, 

if 

yt! = yt 3 , y tl ± y t2 

(1) 

dnk, 


O.W. 


-{ 

ML, 

CL, 

if yb! = yb 2 
if Vb! 7^ yb 2 ■ 

(2) 


We can derive the mutual information between a relative 
constraint and the associated instance cluster labels as (see 
Appendix for the derivation) 

I(Y t ; l t ) = 2 log AT — (1 — P dn k) log(AT — 1 ) 

— Pdnk log[AT 2 — 2(I\ — 1)], 



Fig. 2. Mutual information between instance cluster labels and constraint 
labels as a function of the number of clusters. 


and that for a pairwise constraint as 

I(Y b ; V b ) = log K- Pcl log(iC - 1), (4) 

where P dnk = 1 - 2 (K - 1 )/K 2 , and P CL = 1-1 /K. 

Figure [2] plots the values of Q and 0 as a function 
of the number of clusters K. Comparing the values of one 
relative const and one pairwise const, we see that, in the 
absence of other information, a relative constraint provides 
more information. One might argue that labeling a triplet 
requires inspecting more instances than labeling a pair, making 
this comparison unfair. To address this bias, we compare the 
information gain from the two types of constraints with the 
same number of instances, namely, comparing the values of 
two relative constraints with that of three pairwise constraints, 
both involving six instances. In Figure [2] we see again that 
relative constraints are more informative. 

Another aspect worth evaluating is the motivation behind 
explicitly using dnk constraints. In prior work on learning from 
relative constraints, the constraints are typically generated by 
randomly selecting triplets and producing constraints based 
their class labels. If a triplet can not be definitely labeled 
with yes or no, the resulting constraint is not employed 
by the learning algorithm (it is ignored). Such methods are 
by construction not using the information provided by dnk 
answers. However, it is possible to show that in general dnk’ s 
can provide information about instance labels. If dnk’ s are 
ignored, the mutual information can be computed by replacing 
H(Y t \lt = dnk) with H(Y t ), meaning that the dnk’ s are not 
informative about the instance labels. In this case, we have 

I'(Y t -l t ) = 2(1 - P dnk ) log K - (1 - P dnk ) log (K - 1). (5) 
Comparing the values of one relative YN const (which ignores 
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Fig. 3. The dependencies between three instances (xtnXt 2 ,xt 3 ), their 
cluster labels {ytnyt 2 ,yt 3 ), and the constraint label It. 


dnk ) with that of one relative const in Figure [2] we see a clear 
gap between using and not using dnk constraints, implying the 
informativeness of dnk constraints. Additionally, the amount of 
dnk constraints is usually large, especially when the number 
of clusters is large. Consider randomly selecting triplets from 
clusters with equal sizes. There is a 50% chance of acquir¬ 
ing dnk constraints in two-cluster problems, and the chance 
increases to 78% in eight-cluster problems. The information 
provided by such large amount of dnk constraints is substantial. 
Hence, we believe it will be beneficial to explicitly employ and 
model dnk constraints. 


B. Problem Statement 

Let X = [xi,, Xn) t be the given data, where each x t £ 
lZ d and d is the feature dimension. Let Y = [yi,..., t/jv] 7 
be the hidden cluster label vector, where yi is the label of 
Xi. With slight abuse of notation, we use {(t\,t 2 ,tf)}^L 1 to 
denote the index set of M triplets, representing M relative 
constraints. Each (£i,£ 2 ,£ 3 ) contains the indices for the three 
instances in the £-th constraint. Let L = [li,... ,1m] 1 be 
the constraint label vector, where l t £ {yes, no, dnk} is the 
label of (xt lt Xt 2 ,Xt 3 ). Each l t specifies the answer to the 
question: is Xt 3 more similar to Xt 2 than to Xt 3 ? Our goal is to 
partition the data into K clusters such that similar instances 
are assigned to the same cluster, while respecting the given 
constraints. In this paper, we assume that K is pre-specified. 

In the following, we will use I t = {^i, £ 2 , £ 3 } to denote 
the set of indices in the £-th triplet, use I to index all 
the distinct instances involved in the constraints, i.e., / = 
{l < i < N : i £ U ^Lilt}, and use U to index the instances 
that are not in any constraints. 

III. Methodology 

In this section, we introduce our probabilistic model and 
present the proposed objective functions based on this model. 

A. The Probabilistic Model 

We propose a Discriminative Clustering model for Relative 
Constraints (DCRC). Figure [3] shows the proposed proba¬ 
bilistic model defining the dependencies between the input 
instances (x tl ,x t 2 ,Xt 3 ), their cluster labels (y tl , yt 2 , yt 3 ), and 
the constraint label l t for only one relative constraint. For 


TABLE I. Distribution of P(L | Yi), Yt = [yt 1 ,yt 2 ,yt 3 \- 


Cases 

It = yes 

l t = no 

1! 

1- 

yt 1 = yt 2 ,yA yt 3 

1 - e 

e/2 

e/2 

Vt! = Vt 3 ,yt 1 A Vt 2 

e/2 

1 - e 

e/2 

O.W. 

e/2 

e/2 

1 - e 


a collection of constraints, it is possible to have y variables 
connected to more than one (or none) constraint label l if some 
instances appear in multiple constraints (or do not appear in 
any given constraint). 

We use a multi-class logistic classifier to model the condi¬ 
tional probability of y’s given the observed x’s. For simplicity, 
in the following we will use the same notation x to repre¬ 
sent the (d + 1)-dimensional augmented vector [x T , 1] T . Let 
W = [ur,..., wk] T be a weight matrix in 1Z K *( d+1 ), where 
each u>k contains weights on the d-dimensional feature space 
and an additional bias term. Then the conditional probability 
is represented as 


P{y = k\x; W ) 


exp (wlx) 

Efc' ex P ( w k' x ) 


( 6 ) 


In our model, the observed constraint labels only depend 
on the cluster labels of the associated instances. In an ideal 
scenario, the conditional distribution of l t given the cluster 
labels would be deterministic, as described by Eq. 0. How¬ 
ever, in practice users can make mistakes and be inconsistent 
during the annotation process. We address this by relaxing the 
deterministic relationship to the distribution P(l t \Y t ) described 
in Table [Q The relaxation is parameterized by e £ [0,1), 
indicating the probability of an error when answering the 
query. Here we let the two erroneous answers have equal 
probability e/2. Namely, the ideal label of l t (e.g., l t = yes 
if Vu = ^ lh 3 ) is g iven with probability 1 - e, 

and any other labels (no and dnk in this case) are given 
with equal probability e/2. In practice, lower values of e 
are expected when constraints have fewer noise. Alternatively, 
we can view this relaxation as allowing the constraints to 
be soft as needed, balancing the trade-off between finding 
large separation margins among clusters and satisfying all the 
constraints. 


B. Objective 

The first part of our objective is to maximize the likelihood 
of the observed constraints given the instances, i.e., 

max <&(L\Xi',W) = ± \ogP(L\Xp, W) 

= iplog^P^Y^XpW) , 

Y, 

where I indexes the constrained instances as defined in Section 
III-BI and jj is a normalization constant. 

To reduce overfitting, we add the standard L-2 regularization 
for the logistic model, namely, 

R (w) = ^2wiu> k , 

k 
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where each w k is a vector obtained by replacing the bias term 
in Wk with 0. 

In addition to satisfying the constraints, we also expect the 
clustering solution to separate the clusters with large margins. 
This objective can be captured by minimizing the conditional 
entropy of instance cluster labels given the observed features 
ED- Since the cluster information about constrained instances 
is captured by Eq. ([?}, we only impose such entropy mini¬ 
mization on the unconstrained instances, i.e., 

H(Y u \X u ;W) = -^ J2 H i P (y*\x^ W )\ • 

I I ieu 

Adding the above terms together, our objective is 

max $(L|X /; W) - tH{Y u \X u] W) - A R(W) . (8) 

w 

In some cases, we may also wish to maintain a balanced 
distribution across different clusters. This can be achieved by 
maximizing the entropy of the estimated marginal distribution 
of cluster labels ll2()ll . i.e., 

H(y\X;W) = - Ef=i Pfc logpfc , 

where we denote the estimated marginal probability as pk = 
P(y = k\X\ W) = X Y^ =1 Pik and p ik = P(y l = k\xf, W). 

In cases where balanced clusters are desired, our objective 
is formulated as 

max $(L\X i; W) - \R(W) 

+ t [H(y\X ; W) - H(Yu\Xu ; W)\ , ( ’ 

where we use the same coefficient r to control the enforcement 
of the cluster separation and cluster balance terms, since they 
are roughly at the same scale. 

The two objectives ([8} and © are non-concave, and op¬ 
timization generally can only be guaranteed to reach a local 
optimum. In the next section, we present a variational EM 
solution and discuss an effective initialization strategy. 


IV. Optimization 

Here we consider optimizing the objective in Eq. ([9}. which 
enforces cluster balance. The objective © is simpler and can 
be optimized following the same procedure by simply remov¬ 
ing the corresponding terms employed for cluster balance. 

Computing the log-likelihood Eq. © requires marginalizing 
over hidden variables Yj. Exact inference may be feasible 
when the constraints are highly separated or the number of 
constraints is small, as this may produce a graphical model 
with low tree-width. As more y’s are related to each other 
via constraints, marginalization becomes more expensive to 
compute, and it is in general intractable. For this reason, we 
use the variational EM algorithm for optimization. 

Applying lensen’s inequality, we obtain the lower bound of 
the objective as follows 


LB ~ I W E Q(Yi) 


10 z ( PXiMX 1 ;W) ) 
108 i Q(Yi) > 


- XR(W) 


+ r[H{y\X- W) - H(Yu\X u; W)\ , 


( 10 ) 


TABLE n. The values of Q(lt\yi = k),i e It- For simplicity, we 
DENOTE q jk = q(y tj = k) AND q j% = q(y tj + k). 


Cases 

It = yes 

It = no 


It = dnk 

i = t\ 

Q 2 kq 3 k 

92fe93fe 

l 

~~ q2kq 3 k — ^2fc93fc 

i = t2 

Qi kq 3 k 

52 9lu«3u 
u^k 

l 

- qikQ 3 k - 52 9iu93« 

u^k 

i = i3 

u^k 

qikq2k 

l 

- qik<i2k - 52 9iu92u 

u^k 


where Q{Yi) I s a variational distribution. In variational EM, 
such lower bound is maximized alternately in the E-step 
and M-step respectively ED- In each E-step, we aim to 
find a tractable distribution Q(Yj ) such that the Kullback- 
Leibler divergence between Q{Yj) and the posterior distribu¬ 
tion P(Yj\L, Xi\W) is minimized. Given the current Q(Yj), 
each M-step finds the new W that maximizes the LB. Note 
that in the objective (and the LB), only the likelihood term is 
relevant to the E-step. The other terms are only used in solving 
for W in the M-steps. 


A. Variational E-Step 

We use mean field inference 1 1221 , |f23l to approximate the 
posterior distribution in part due to its ease of implementa¬ 
tion and convergence properties ll24l . Mean field restricts the 
variational distribution Q(Yj) to the tractable fully-factorized 
family Q(Yj ) = Tl/e/ 9(j/i)> an£ l finds the Q(Yj) that min¬ 
imizes the KL-divergence KL[Q(Yj)\\P(Yj\L, Xj\ W)\. The 
optimal Q(Yi) is obtained by iteratively updating each q(jji) 
until Q(Yi ) converges. The update equation is 

q(Vi) = ^exp{ J E Q(rAi) [log.P(X/,r / ,L)]} , (11) 

where Q(Yj\i) = Ylj^i an d Z is a normalization 

factor to ensure Yh Vi oiVi) = 1- tfi e following, we derive a 
closed-form update for this optimization problem. 

Applying the model independence assumptions, the expec¬ 
tation term in Eq. (fTTb is simplified to 

M 

E Q(Y IV )[T, logP(l t \Y t ) + E logP(yj\xj-,W) + logP(Xj)] 

t=i jei 

= E E Q(Y , t \i) [log P(h\Yt)] + logP(yi\xi-,W) + const, 

t-.ieit 

(12) 

where I t \i is the set of indices in I t except for i, and const 
absorbs all the terms that are constant with respect to yi. 
The first term in (fl2l > sums over the expected log-likelihood 
of observing each l t given the fixed y t . To compute the 
expectation, we first let Q{lt\yi) be the probability that the 
observed l t is consistent with the Y t given a fixed yi. That is, 
Q(lt\yi) is the probability for all possible assignments of Y t 
given a fixed yi, such that P(l t \Y t ) = 1 — e according to Table 
U] The Q(Jt\yi) can be computed straightforwardly as in Table 
im Then each of the expectations in (ITU is computed as 

E \\ogP{l t \yi)\ = [1 - Q(lt\yi)] log | + Q{lt\yi) log(l - e). 
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From the above, the update Eq. ( 1111 is derived as 


QiVi) 


a F ^P{ yi \ Xi -W) 

£ yi a F ^P{ yi \ Xi -,Wy 


with a = 


2(1 - e) 


(13) 


where F( yi ) = Q{lt\yi)- 

The term F(yi) can be interpreted as measuring the compat¬ 
ibility of each assignment of y l with respect to the constraints 
and the other y’s. In Eq. (IT3l >. a is controlled by the parameter 
e. When e £ (0, |), a > 1 and the update allows more 
compatible assignments of y,, i.e., the ones with higher F(yi), 
to have larger q(yi). When e = |, the constraint labels are 
regarded as uniformly distributed regardless of the instance 
cluster labels, as can be seen from Table Q] In this case, a = 1 
and each q(jji) is directly set to the conditional probability 
P(yi\xi] W). This naturally reduces our method to learning 
without constraints. Clearly, when e is smaller, the constraints 
are harder and the updates will push q{yi) to more extreme 
distributions. Note that the values of e £ (|, 1) cause a < 1, 
which will lead to results that contradict the constraints, and 
are generally not desired. 

Special Case: Hard Constraints. In the special case where 
e = 0 and a = oo, P(h\Yt) essentially reduces to the 
deterministic model described in Eq. 0 - allowing our model to 
incorporate hard constraints. The update equation of this case 
can also be addressed similarly to Eq. ( IT3l ). In this case, q(yi) is 
non-zero only when the value of F(yi) is the maximum among 
all possible assignments of j/j. Thus, the update equation is 
reduced to a max model. More formally, we define the max- 
compatible label set for each instance 27 as 


Yi = {1 < k < K : F( yi = k) > F{ yi = fc'),V k' ± k}. 


Namely, each Yi contains the most compatible assignments for 
y, with respect to the constraints. Then the update equation 
becomes 


q(yi) = 


[ 


P(yi\xj-,W) 

J2 y ' G Y i p (y'i\ x c w )' 

0 , 


if ViCYi , 

o.w. 


(14) 


B. M-Step 

The M-step searches for the parameter W that maximizes 
the LB. Applying the independence assumptions again and 
ignoring all the terms that are constant with respect to W, we 
obtain the following objective 

max J = gPiY^W) - X R (W) 

+T[H(y\X-W)-H(Y u \X u -W)}. 

This objective is non-concave and a local optimum can 
be found via gradient ascent. We used L-BFGS f25l in our 
experiments. The derivative of J w.r.t. W is 

§w= wEieiiQi-P^I-^w 

+]Fj Eject EfcCU - p i)Pjk log p jk xj 
~ ~N En=l Efc(lfc — Pn)Pnk ^ 0 gPkX n , 


where Pi = J pa,.. .,Pzk] t , Qi = [qn, • • •, with q ik = 
q(yi = k), W = [wi,... ,u>k] T , and l k is a iT-dimensional 
vector that contains the value 1 on the fc-th dimension and 0 
elsewhere. 

The above derivations use a linear model for P(y |x; W), 
and thus the learned DCRC is also linear. However, all of 
the results can be easily generalized to using kernel functions, 
allowing DCRC to find non-linear separation boundaries. 

C. Complexity and Initialization 

In each E-step, the complexity is (D(yK\I\), where 7 is the 
number of mean-field iterations for Q(Yj) to converge. In the 
M-step, the complexity of computing the gradient of W in 
each L-BFGS iteration is O(NKD). 

Although mean-field approximation is guaranteed to con¬ 
verge, in the first few E-steps it is not critical to achieve a very 
close approximation. In practice, we can run mean-field update 
up to a fixed number of iterations (e.g., 100). We empirically 
observe that the approximation still converges very fast in later 
EM iterations. Similarly, we observe in the M-step that the L- 
BFGS optimization usually converges with very few iterations 
in the later EM runs, and a completion of a fixed number of 
iterations for L-BFGS is also sufficient in the first few M-steps. 

The EM algorithm is generally sensitive to the initial 
parameter values. Here we first apply Kmeans and train a 
supervised logistic classifier with the clustering results. The 
learned weights are then used as the starting point of DCRC. 
Empirically we observe that such initialization typically allows 
DCRC to converge within 100 iterations. 

V. Experiments 

In this section, we experimentally examine the effectiveness 
of our model in utilizing relative constraints to improve clus¬ 
tering. We first evaluate all methods on both UCI and other 
real-world datasets with noise-free constraints generated from 
true class labels. We then present a preliminary user study 
where we ask users to label constraints and evaluate all the 
methods on these human-labeled (noisy) constraints. 

A. Baseline Methods and Evaluation Metric 

We compare our algorithm with existing methods that con¬ 
sider relative constraints or pairwise constraints. The methods 
employing pairwise constraints are Xing's method |[ 2 | (distance 
metric learning for a diagonal matrix) and ITML ll26l . These 
are the state-of-the-art methods that are usually compared in 
the literature and have publicly available source code. 

For methods considering relative constraints, we compare 
with: 1) LSML (TD, a very recent metric learning method 
studying relative constraints (we use Euclidean distance as the 
prior); 2) SSSVaD [16), a method that directly finds clustering 
solutions with relative constraints; and 3) sparseLP |[T3l . an 
earlier method that hasn’t been extensively compared. We also 
experimented with a SVM-style method proposed in lfl2l and 
observed that its performance is generally worse. Thus, we do 
not report the results on this method. 
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(a) Ionosphere 









(g) Stonefly9 (h) Birdsong 


—e— DCRC DCRC-YN LSML -v^sparseLP -^SSSVaD --fr-ITML -^-Xing 


Fig. 4. (Best viewed in color.) The F-measure as a function of number of relative constraints. Results are averaged over 20 runs with independently sampled 
constraints. Error bars are shown as 95% confidence intervals. 
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TABLE III. Summary of Dataset Information 


Dataset 

#Inst. 

#Dim. 

#Cluster 

Ionosphere 

351 

34 

2 

Pima 

768 

8 

2 

Balance-scale 

625 

4 

3 

Digits-389 

3165 

16 

3 

Letters-IJLT 

3059 

16 

4 

MSRCv2 

1046 

48 

6 

Stonefly9 

3824 

285 

9 

Birdsong 

4998 

38 

13 




(a) Datal: Soft Const. (88.33%) (b) Datal: Hard Const. (83.33%) 


Xing’s method, ITML, LSML, and sparseLP are metric 
learning techniques. Here we apply Kmeans with the learned 
metric (50 times) to form cluster assignments, and the cluster¬ 
ing solution with the minimum mean-squared error is chosen. 

We evaluated the clustering results based on the ground-truth 
class labels using pairwise F-measure a , Adjusted Rand Index 
and Normalized Mutual Information. The results are highly 
similar with different measures, thus we only present the F- 
Measure results. 


B. Controlled Experiments 

In this set of experiments, we use simulated noise-free 
constraints to evaluate all the methods. 

1) Datasets: We evaluate all methods on five UCI datasets: 
Ionosphere, Pima, Balance-Scale, Digits-389, and Letters- 
IJLT. We also use three extra real-world datasets: 1) a subset of 
image segments of the MSRCv2 clatiQ which contains the six 
largest classes of the image segments; 2) the HJA Birdsong 
data 03, which contains automatically extracted segments 
from spectrograms of birdsong recordings, and the goal is to 
identify the species for each segment; and 3) the Stonefly9 data 
ll28l . which contains insect images and the task is to identify 
the species of the insect for each image. Table [III] summarizes 
the dataset information. In our experiments, all features are 
standardized to have zero mean and unit standard deviation. 

2) Experimental Setup: For each dataset, we vary the 
number of constraints from 0.057V to 0.37V with a 0.057V 
increment, where TV is the total number of instances. For each 
size, triplets are randomly generated and constraint labels are 
assigned according to Eq. 0. We evaluated our method in two 
settings, one with all constraints as input (shown as DCRC ), 
and the other with only yes/no constraints (shown as DCRC- 
YN). The baseline methods for relative constraints are designed 
for yes/no constraints only and cannot be easily extended to 
incorporate dnk constraints, so we drop the dnk constraints for 
these methods. To form the corresponding pairwise constraints, 
we infer one ML and one CL constraints from each relative 
constraint with yes/no labels (note that no pairwise constraints 
could be directly inferred from dnk relative constraints). Thus, 
all the baselines use the same information as DCRC-YN, since 
no dnk constraints are employed by them. 

We use five-fold cross-validation to tune parameters for 
all methods. The same training and validation folds are used 
across all the methods (removing dnk constraints, or converting 
to pairwise constraints when necessary). For each method. 


1 http://research.microsoft.com/en-us/projects/ObjectClassRecognition/ 



(c) Data2: Soft Const. (100%) 



(d) Data2: Hard Const. (100%) 


Fig. 5. Estimated entropy using soft/hard constraints on synthetic datasets. 
Cluster assignments are represented with blue, pink, and green points. Entropy 
regions are shaded, with darker color representing higher entropy. Prediction 
accuracy on instance cluster labels is shown in the parentheses. 


we select the parameters that maximize the averaged con¬ 
straint prediction accuracy on the validation sets. For our 
method, we search for the optimal r G {0.5,1,1.5} and 
A G {2 -10 , 2 -8 , 2 -6 , 2 -4 , 2 -2 }. We empirically observed 
that our method is very robust to the choice of e when it 
is within the range [0.05,0.15]. Here we set e = 0.05 for 
this set experiments with the simulated noise-free constraints. 
Experiments are repeated using 20 randomized runs, each with 
independently sampled constraints. 

3) Overall Performance: Figure [4] shows the performance of 
all methods with different number of constraints. The sparseLP 
does not scale to the high-dimensional Stonefly9 dataset and 
hence is not reported on this particular data. 

From the results we see that DCRC consistently outperforms 
all baselines on all datasets as the constraints increase, demon¬ 
strating the effectiveness of our method. 

Comparing DCRC with DCRC-YN, we observe that the 
additional dnk constraints provide substantial benefits, espe¬ 
cially for datasets with large number of clusters (e.g., MSRCv2, 
Birdsong). This is consistent with our expectation because the 
portion of dnk constraints increases significantly when K is 
large, leading to more information to be utilized by DCRC. 

Comparing DCRC-YN with the baselines, we observe that 
DCRC-YN achieves comparable or better performance even 
compared with the best baseline ITML. This suggests that, with 
noise-free constraints, our model is competitive with the state- 
of-the-art methods even without considering the additional 
information provided by dnk constraints. 

4) Soft Constraints v,v. Hard Constraints: In this set of 
experiments, we explore the impact on our model when soft 
constraints (e = 0.05) and hard constraints (e = 0) are used 
respectively. We first use two synthetic datasets to examine 
and illustrate their different behaviors. These two datasets 
each contain three clusters, 50 instances per cluster. The 
clusters are close to each other in one dataset, and far apart 
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Fig. 6. Performance of DCRC using soft constraints vs. hard constraints. 


(and thus easily separable) in the other. For each dataset, 
we randomly generated 500 relative constraints using points 
near the decision boundaries. Figure [5] shows the prediction 
entropy and prediction accuracy on instances cluster labels for 
both datasets achieved by our model, using soft and hard con¬ 
straints respectively. We can see that when clusters are easily 
separable, both soft and hard constraints produce reasonable 
decision boundaries and perfect prediction accuracy. However, 
when cluster boundaries are fuzzy, the results of using soft 
constraints appear preferable. This indicates that by softening 
the constraints, our method could search for more reasonable 
decision to avoid overfitting to the constrained instances. 

We then compare the performances of using soft (e = 0.05) 
versus hard (e = 0) constraints on real datasets with the same 
setting utilized in Section IV-B3I Due to space limit, here we 
only show results on four representative datasets in Figure [6] 
The behavior of other datasets are similar. We can see that 
using soft constraints generally leads to better performance 
than using hard constraints. In particular, on the MSRCv2 
dataset, using hard constraints produces a large “dip” at the 
beginning of the curve while this issue is not severe for soft 
constraints. This suggests that using soft constraints makes our 
model less susceptible to overfitting to small sets of constraints. 

5) Effect of Cluster Balance Enforcement: This set of ex¬ 
periments test the effect of the cluster balance enforcement on 
the performance of DCRC for the unbalanced Birdsong and 
the balanced Letters-IJLT datasets. Figure [7] reports the perfor¬ 
mance of DCRC (soft constraints, e = 0.05) with and without 
such enforcement with varied number of constraints. We see 
that when there is no constraint, it is generally beneficial to 
enforce the cluster balance. The reason is, when cluster balance 
is not enforced, the entropy that enforces cluster separation can 
be trivially reduced by removing cluster boundaries, causing 
degenerate solutions. However, as the constraint increases, 
enforcing cluster balance on the unbalanced Birdsong hurts 


Fig. 7. Performance of DCRC with/without cluster balance enforcement. 

the performance. Conceivably, such enforcement would cause 
DCRC to prefer solutions with balanced cluster distributions, 
which is undesirable for datasets with uneven classes. On the 
other hand, appropriate enforcement on the balanced Letters- 
IJLT dataset provides further improvement. In practice, one 
could determine whether to enforce cluster balance based on 
prior knowledge of the application domain. 

6) Computational Time: We record the runtime of learning 
with 1500 constraints on the Birdsong dataset, on a standard 
desktop computer with 3.4 GHz CPU and 11.6 GB of memory. 
On average it takes less than 2 minutes to train the model using 
an un-optimized Matlab implementation. This is reasonable for 
most applications with similar scale. 

C. Case Study: Human-labeled Constraints 

We now present a case study where we investigate the 
impact of human-labeled constraints on the proposed method 
and its competitors. 

1) Dataset and Setup: This case study is situated in one 
of our applications where the goal is to find bird singing 
patterns by clustering. The birdsong dataset used in Section 
I V-BI contains spectrogram segments labeled with bird species. 
In reality, birds of the same species may vocalize in different 
patterns, which we hope to identify as different clusters. To¬ 
ward this goal, we created another birdsong dataset consisting 
of clusters that contains relatively pure singing patterns. We 
briefly describe the data generation process as follows. 

We first manually selected a collection of representative 
examples of the singing patterns, and then use them as 
templates to extract segments from birdsong spectrograms by 
applying template matching. Each of the extracted segments 
is assigned to the cluster represented by the corresponding 
template. We then manually inspected and edited the clusters 
to ensure the quality of the clusters. As a result, each cluster 
contains relatively pure segments that are actually from the 
same bird species and represent the same vocalization pattern. 
See Figure [7] for examples of several different vocalization 
patterns, which we refer to as syllables. We extract features 
for each segment using the same method as described in J27j. 
This process results in a new Birdsong dataset containing 2601 
instances and 14 ground-truth clusters. 

After obtaining informed consents according to the protocol 
approved by the Institutional Review Board of our institution, 
we tested six human subjects’ behaviors on labeling con¬ 
straints. None of the users has any prior experience/knowledge 
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TABLE IV. The average confusion matrix of the human 

LABELED CONSTRAINTS VS. THE CONSTRAINT LABELS INFERRED FROM 
TRUE INSTANCE CLUSTERS. 

(a) Relative Constraints (b) Pairwise Constraints 


True 

Human Labels 
yes no dnk 


True 

Human Labels 
ML CL 

yes 

18.50 0.33 4.83 


ML 

42.50 

9.67 

no 

dnk 

0.33 16.50 4.50 

3.50 3.00 98.50 


CL 

10.83 

162.00 


TABLE V. F-MEASURE PERFORMANCE (MEAN ± STD) WITH THE 
HUMAN LABELED CONSTRAINTS. 

(a) Without using constraints (b) Using 150 relative constraints 


Method 

F-Measure 

DCRC-NoConst 

Kmeans 

0.5175 ± 0.0232 
0.6523 ± 0.0189 


Method 

F-Measure 

DCRC 

DCRC-YN 

LSML 

sparseLP 

SSSVaD 

0.7620 ± 0.1335 
0.7635 ± 0.1067 
0.6409 ± 0.0654 
0.5200 ± 0.0706 
0.6046 ± 0.0605 


(c) Using 150 pairwise constraints 


Method 

F-Measure 

ITML 

Xing 

0.6409 ± 0.0424 
0.6438 ± 0.0423 


(d) Using 225 pairwise constraints 


Method 

F-Measure 

ITML 

Xing 

0.6347 ± 0.0372 
0.6438 ± 0.0282 


on the data. They were first given a short tutorial on the 
data and the concepts of clustering and constraints. Then each 
user is asked to label randomly selected 150 triplets, and 225 
pairs, using a graphical interface that displays the spectrogram 
segments. To neutralize the potential bias introduced by the 
task ordering (triplets vs. pairs), we randomly split the users 
into two groups with each group using a different ordering. 

2) Results and Discussion: Table II VI lists the average confu¬ 
sion matrix of the human-labeled constraints versus the labels 
produced based on the ground-truth cluster labels. From Table 
|IV(a)| we see that the dnk constraints make up more than 
half of the relative constraints, which is consistent with our 
analysis in Section Hi- Al that the number of dnk constraints can 
be dominantly large. The users rarely confuse between the yes 
and no labels but they do tend to provide more erroneous dnk 
labels. This phenomenon is not surprising because when in 
doubt, we are often more comfortable to abstain from giving 
an definite yes/no answer and resort to the dnk option. 

For pairwise constraints, the CL constraints are the majority, 
and the confusions for both CL and ML are similar. We note 
that the confusion between the yes/no constraints is much 
smaller than that of MLICL constraints. This shows that the 
increased flexibility introduced by dnk label allows the users to 
more accurately differentiate yeslno labels. The overall labeling 
accuracy of pairwise constraints is slightly higher than that of 
relative constraints. We suspect that this is due to the presence 
of the large amount of dnk constraints. 

We evaluated all the methods using these human-labeled 
constraints. To account for the labeling noise in the constraints, 
we set e = 0.15 for DCRC and DCRC-YhQ The averaged 
results for all methods are listed in Table [V] We observe that 
while most of the competing methods’ performance degrade 

2 For these noisy constraints, our method remains robust to the choice of e. 
Using different values of e ranging from 0.05 to 0.2 only introduces minor 
fluctuations (within 0.01 difference) to the F-measure. 


with added constraints compared with unsupervised Kmeans, 
our method still shows significant performance improvement 
even with the noisy constraints. We want to point out that 
the performance difference we observe is not due to the use 
of the multi-class logistic classifier. In particular, as shown in 
Table |V(a)[ without considering any constraints, the logistic 
model achieves significantly lower performance than Kmeans. 
This further demonstrates the effectiveness of our method in 
utilizing the side information provided by noisy constraints to 
improve clustering. 

Recall that ITML is competitive with DCRC-YN previously 
considering noise-free constraints. Here with noisy constraints, 
DCRC-YN achieves far better accuracy than ITML, suggesting 
that our method is much more robust to labeling noise. It is 
also worth noting that although the dnk constraints tend to be 
quite noisy, they do not seem to degrade the performance of 
DCRC compared with DCRC-YN. 

Our case study also points to possible ways to further 
improve our model. As revealed by Table [TV] the noise on 
the labels for relative constraints is not uniform as assumed 
by our model. An interesting future direction is to introduce 
a non-uniform noise process to more realistically model the 
users’ labeling behaviors. 

VI. Related Work 

Clustering with Constraints: Various techniques have been 
proposed for clustering with pairwise constraints Ei-i!, m. 
Our work is aligned with most of these methods in the sense 
that we assume the guidance for labeling constraints is the 
underlying instance clusters. 

Fewer work has been done on clustering with relative 
constraints. The work in mn-cu propose metric learning 
approaches that use d(xi,Xj) < d{xi,Xk ) to encode that Xi 
is more similar to Xj than to xk, where d(-) is the distance 
function. The work (15) studies learning from relative compar¬ 
isons between two pairs of instances, which can be viewed as 
the same type of constraints when only three distinct examples 
are involved. By construction, these methods only consider 
constraints with yeslno labels. Practically, such answers might 
not always be provided, causing limitation of their applications. 
In contrast, our method is more flexible by allowing users to 
provide dnk constraints, 

There also exist studies that encode the instance relative 
similarities in the form of hieratical ordering and attempt 
hierarchical algorithms that directly find clustering solutions 
satisfying the constraints ED, 05 ). Different with those stud¬ 
ies, our work builds on a natural probabilistic model that has 
not been considered for learning with relative constraints. 

Semi-supervised Learning: Related work also exists in a 
much broader area of semi-supervised learning, involving 
studies on both clustering and classification problems. The 
work fl9ll proposes that to enforce the formed clusters with 
large separation margins, we could minimize the entropy on 
the unlabeled data, in addition to learning from the labeled 
ones. The study |20| suggests to also maximize the entropy of 
the cluster label distribution in order to find balanced clustering 
solution. Our final formulation draws inspiration from the 
above work. 
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VII. Conclusions 

In this paper, we studied clustering with relative constraints, 
where each constraint is generated by posing a query: is x, 
more similar to Xj than to Xk■ Unlike existing methods that 
only consider yes/no responses to such queries, we studied the 
case where the answer could also be dnk (don’t know). We 
developed a probabilistic method DCRC that learns to cluster 
the instances based on the responses acquired by such queries. 
We empirically evaluated the proposed method using both 
simulated (noise-free) constraints and human-labeled (noisy) 
constraints. The results demonstrated the usefulness of dnk 
constraints, the significantly improved performance of DCRC 
over existing methods, and the superiority of our method in 
terms of the robustness to noisy constraints. 
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Appendix 

This appendix provides the derivation of the mutual information 
Eq. (3). The derivations for Eqns. (4) and (5) are similar and are 
omitted here. 

By definition, the mutual information between the instance labels 
Yt = [yt i, yt 2 , Vt 3 ] and the constraint label It is 

I(Y t -l t ) = H{Y t )-H{Yt\lt). (15) 

The first entropy term is H(Y t ) = — J2y t P Nt) log p (Yt) = 

3 log A, where we used the independence assumption P(Yi) = 
P?_ 1 P(yt i ) and substituted the prior = k) = 1/A'. By 

definition, the second entropy term is 

H(Y t \lt) = ~Y, p = «) l°g P(Yt\h = a). 

a G {yes.no, dnk } Yt 

Now we need to compute the marginal distribution P(lt) and the 
conditional distribution P(Yt\lt)- Based on Eq. (1), the P(lt) are 

P{lt = yes) = £y t P(Y t )P{h = yes\Y t ) 

= E p {ytt = k)P(yt 2 = k) [1 - P(y t3 =k)} = ^ 

k=1 

By distribution symmetry, P(lt = no) = P(lt = yes). Then P(lt = 
dnk) = 1 - P(l t = yes) - P(l t = no) = 1 - [2{K - 1 )\/IC. To 
compute P{Yt\lt), we notice that for the cluster label assignments 
that do not satisfy the conditions for the corresponding l t described 
in Eq. (1), the probability P(Yt\lt) = 0. For those satisfying such 
conditions, the P(Y t \l t ) are 

P(Y t \l t =yes) = [P(Y t )P(l t =yes\Y t )\/P(l t =yes) 

= [P(Y t )xl]/P(l t =yes) = 1 ^ IJ . 



By symmetry again, P(Yt\lt = no ) = P(Yt\lt =yes). Also, 

P(Y t \l t = dnk) = [P(Y t )P{lt = dnk\Y t )\/P(lt = dnk) 

= [ p (Xt) X 1 ]/P(lt = dnk) = . 

Substituting the values of P(Yt\lt) and P(Yt ), we obtain 

H{Yt\lt) = log K +(1 — Pdnk) log(K — 1)+Pdnk \og[K 2 — 2{K — l)}, 

where we denote Pdnk = P{lt = dnk). 

Substituting H(Yt) and H(Yt\lt) into Eq. J 1 5b . we derive 

I{Yf,h) = 2lo S K-{l-P dnk )log(K-l)-P dnk log[K 2 -2{K-l)]. □ 



